ICDL crawling

ICDL crawling is an open distributed web crawling technology based on Website Parse Template (WPT).

Contents

What is Website Parse Template?

Website Parse Template (WPT) is an XML based open format which provides HTML structure description of Web pages. The WPT format allows web crawlers to generate Semantic Web’s RDF triples for Web pages. WPT is compatible with existing Semantic Web concepts defined by W3C (RDF, OWL) and UNL specifications.

Distributed ICDL crawling

ICDL crawling involves parsing of websites’ content considering HTML structure templates represented in WPT files.

Distributed crawling is carried out by an open source client application installed on volunteers’ personal computers (PCs). After authentication procedures, the application registers each PC as a distributed crawling node. The crawler periodically receives tasks from the management console to download specified websites, parse their content and submit the results into parsed content storage. Crawling processes are activated when the users’ computers are idle.

Internet content parse results from several crawlers are compared by the management console to increase crawling results' accuracy. Crawling results can be stored to be used by thematic and general search engines with different search algorithms, such as Google, Live, Yahoo!, Froogle, etc.

See also

External links